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Abstract 

Many inference problems in structured pre¬ 
diction are naturally solved by augmenting a 
tractable dependency structure with complex, 
non-local auxiliary objectives. This includes 
the mean field family of variational inference 
algorithms, soft- or hard-constrained inference 
using Lagrangian relaxation or linear program¬ 
ming, collective graphical models, and forms 
of semi-supervised learning such as posterior 
regularization. We present a method to dis- 
criminatively learn broad families of inference 
objectives, capturing powerful non-local statis¬ 
tics of the latent variables, while maintain¬ 
ing tractable and provably fast inference using 
non-Euclidean projected gradient descent with a 
distance-generating function given by the Bethe 
entropy. We demonstrate the performance and 
flexibility of our method by (1) extracting struc¬ 
tured citations from research papers by learning 
soft global constraints, (2) achieving state-of-the- 
art results on a widely-used handwriting recog¬ 
nition task using a novel learned non-convex 
inference procedure, and (3) providing a fast 
and highly scalable algorithm for the challeng¬ 
ing problem of inference in a collective graphical 
model applied to bird migration. 

1 INTRODUCTION 

Structured prediction has shown great success in modeling 
problems with complex dependencies between output vari¬ 
ables. Practitioners often use undirected graphical models, 
which encode conditional dependency relationships via a 
graph. However, the tractability of exact inference in these 
models is limited by the graph’s treewidth , often yielding a 
harsh tradeoff between model expressivity and tractability. 

Equal contribution. 


Graphical models are good at modeling local dependencies 
between variables, such as the importance of surrounding 
context in determining the meaning of words or phrases. 
However, their sensitivity to cyclic dependencies often ren¬ 
ders them unsuitable for modeling preferences for certain 
globally consistent states. For example, in the canonical 
NLP task of part-of-speech tagging, there is no clear way 
to enforce the constraint that every sentence have at least 
one verb without increasing the likelihood that every token 
is predicted to be a verb. 

Concretely, exact marginal inference in a discrete graphical 
model can be posed as the following optimization problem 

fi* = argmin - H(fi) - (9,fj.) , (1) 

where ji is a concatenated vector of node and clique 
marginals, ff(/x) is the entropy, A4 is the marginal poly¬ 
tope, and 9 are parameters. Here we face a tradeoff: adding 
long-range dependencies directly to the model increases 
the clique size and thus the complexity of the problem and 
size of //, rendering inference intractable. However, the 
linear scoring function 9 breaks down over cliques, pre¬ 
venting us from enforcing global regularities in any other 
way. In this work, we propose to augment the inference 
objective 0 and instead optimize 

H* = argmin -H(n) - (0,/z) + (2) 

riGAt 

Here, is some arbitrary parametric function of the en¬ 
tire concatenated marginal vector, where ip may depend on 
input features. Since L^ is non-linear, it can enforce many 
types of non-local properties. Interestingly, whenever L^ 
is convex, and whenever inference is easy in the underlying 
model, i.e., solving 0 is tractable, we can solve 0 using 
non-Euclidean projected gradient methods using the Bethe 
entropy as a distance-generating function. Unlike many 
message-passing algorithms, our procedure maintains pri¬ 
mal feasibility across iterations, allowing its use as an any¬ 
time algorithm. Furthermore, for non-convex L^, we also 
show convergence to a local optimum of 0. Finally, we 




present algorithms for discriminative learning of the pa¬ 
rameters ip. In a slight abuse of terminology, we call L^ a 
non-local energy function. 


Ours is not the first work to consider modeling global pref¬ 
erences by augmenting a tractable base inference objec¬ 
tive with non-local terms. For example, generalized mean- 
field variational inference algorithms augment a tractable 
distribution (the Q distribution) with a non-linear, non- 
convex global energy function that scores terms in the full 
model (the P distribution) using products of marginals of 
Q ( [Wainwright & Jordan 2008j l. This is one special case 
of our non-local inference framework, and we present algo¬ 
rithms for solving the problem for much more general Ly,, 
with compelling applications. 


Additionally, the modeling utility provided by global pref¬ 
erences has motivated work in dual decomposition, where 
inference in loopy or globally-constrained models is de¬ 
composed into repeated calls to inference in tractable in¬ 
dependent subproblems (Komodakis et al. 2007| [Sontag 


|et al.||20TT] l. It has seen wide success due to its ease of im¬ 
plementation, since it reuses existing inference routines as 
black boxes. However, the technique is restricted to mod¬ 
eling linear constraints, imposed a priori. Similarly, these 
types of constraints have also been imposed on expecta¬ 
tions of the posterior distribution for use in semi-supervised 
learning, as in posterior regularization and generalized 
expectation (Ganchev et al.| |2010| |Mann &~ McCallum 


20101. In contrast, our methods are designed to discrim - 


inatively learn expressive inference procedures, with min¬ 
imal domain knowledge required, rather than regularizing 
inference and learning. 


2 BACKGROUND 


Let y = (y-[,... ,y n ) denote a set of discrete variables 
and x be a collection of input features. We define the 
conditional distribution Pg( y|x) = exp((0(x), Spy)))/Z, 
where Spy) is a mapping from y to a set of sufficient 
statistics, (Lx) is a differentiable vector-valued mapping, 
and Z = Yl y exp((0, .S'(y))). Conditional random fields 
(CRFs) assume that (j/i,..., y n ) are given a graph struc¬ 
ture and Spy) maps y to a 0-1 vector capturing joint set¬ 
tings of each clique (Lafferty et alT] |200 1 1 > . Going forward, 
we often suppress the explicit dependency of 9 on x. For 
fixed 9, the model is called a Markov random field (MRF). 


Given a distribution Ppy), define the expected sufficient 
statistics operator p{P) = Ep[S'(y)]. For the CRF statis¬ 
tics Spy) above, p is a concatenated vector of node and 
clique marginals. Therefore, marginal inference, the task 
of finding the marginal distribution of Pg(y|x) over y, is 
equivalent to computing the expectation p(Pg(y\x.)). 


For tree-structured graphical models, Pg{ y|x) i—> 
p(Pg{ y|x)) is a bijection, though this is not true 
for general graphs. Furthermore, for trees the en¬ 
tropy H(Pg( y|x)) is equal to the Bethe entropy 


Hj 3 (p(Pg py\x))), defined, for example, in Wainwright & 


|Jordan| ( |20 08). The marginal polytope A4 is the set of p 
that correspond to some Pg. 


As mentioned in the introduction, marginal inference can 
be posed as the optimization problem 0 - MAP inference 
finds the joint setting y with maximum probability. For 
CRFs, this is equivalent to 


First, we provide efficient algorithms for solving the 
marginal inference problem ([2) and performing MAP pre¬ 
diction in the associated distribution, for both convex and 
non-convex global energy functions. After that, we provide 
a learning algorithm for 9 and the parametrized L,/, func¬ 
tions using an interpretation of ([2]) as approximate varia¬ 
tional inference in a a probabilistic model. All of our algo¬ 
rithms are easy to implement and rely on simple wrappers 
around black-box inference subroutines. 


arg min (—0(x), Spy)). (3) 

y 

For tree-structured CRFs, marginal and MAP inference can 
be performed efficiently using dynamic programming. Our 
experiments focus on such graphs. However, the inference 
algorithms we present can be extended to general graphs 
wherever marginal inference is tractable using a convex en¬ 
tropy approximation and a local polytope relaxation. 


Our experiments demonstrate the power and generality of 
our approach by achieving state-of-the-art results on sev¬ 
eral tasks. We extract accurate citations from research pa¬ 
pers by learning discriminative global regularities of valid 
outputs, outperforming a strong dual decomposition-based 
baseline (Anz aroot et al.| |2014| >. In a benchmark OCR 
task (Tas kar et al. 2004) 1, we achieve state-of-the-art results 
with a learned non-convex, non-local energy function, that 
guides output decodings to lie near dictionary words. Fi¬ 
nally, our general algorithm for solving 0 provides large 
speed improvements for the challenging task of inference in 
chain-structured collective graphical models (CGMs), ap¬ 
plied to bird migration (Sheldon & Dietterich 201 1]>. 


3 MARGINAL INFERENCE WITH 
NON-LOCAL ENERGIES 

We move beyond the standard inference objective <|T]), aug¬ 
menting it with a non-local energy term as in |2|: 

p* = arg min —iFg(/x) - (9, p) + L^(p). 

M 


Here, Ly, is some arbitrary parametrized function of the 
marginals, and ip may depend on input features x. 

Intuitively, we are augmenting the inference objective 0 
by allowing it to optimize a broader set of tradeoffs - not 







































only between expected node scores, clique scores, and en¬ 
tropy, but also global functions of the marginals. To be con¬ 
crete, in our citation extraction experiments (Section [8. 1| >, 
for example, we employ the simple structure: 

-Mm) = 5^Mj(m)> (4) 

0 

Where each ( 3 is a univariate convex function and each ifj 
is constrained to be non-negative, in order to maintain the 
overall convexity of . We further employ 

^(m) = tj (oJm) , (5) 

where a :/ encodes a ‘linear measurement’ of the marginals 
and £j is some univariate convex function. 

4 VARIATIONAL INTERPRETATION 
AND MAP PREDICTION 

We next provide two complementary interpretations of ([2]) 
as variational inference in a class of tractable probability 
distributions over y. They yield precisely the same vari¬ 
ational expression. However, both are useful because the 
first helps motivate a MAP prediction algorithm, while the 
second helps characterize our learning algorithm in Sec¬ 
tion [7] as (approximate) variational EM. 

Proposition 1. For fixed 9 and Lp, the output p* of in¬ 
ference in the augmented objective 0 is equivalent to the 
output of standard inference 0 in an MRF with the same 
clique structure as our base model, but with a modified pa¬ 
rameter 9 = 9 — VL^,(p*) . 

Proof Forming a Lagrangian for Q, the stationarity con¬ 
ditions with respect to the variable p are: 

0 = -(0 - VL*(/i*)) - ViT 8 (M*) + V m C(m, A), (6) 

where C(p,. A) are collected terms relating to the marginal 
polytope constraints. The proposition follows because © 
is the same as the stationarity conditions for 

p,* = argmin- (9 - VL^(p*),p) - H B (p). □ (7) 


Therefore, we can characterize a joint distribution over y 
by first finding p* by solving ([2]) and then defining an MRF 
over y with parameters 9. Even more conveniently, our 
inference technique in Section [6] iteratively estimates 9 on 
the fly, namely via the dual iterate 9 t in Algorithm[l] 

Ultimately, in many prediction problems we seek a single 
output configuration y rather than an inferred distribution 
over outputs. Proposition [I] suggests a simple prediction 
procedure: first, find the variational distribution over y 
parametrized as an MRF with parameter 9. Then, perform 


MAP in this MRF. Assuming an available marginal infer¬ 
ence routine for this MRF, we assume the tractability of 
MAP - for example using a dynamic program. We avoid 
predicting y by locally maximizing nodes’ marginals, since 
this would not necessarily yield feasible outputs. 


Instead of solving we could have introduced global 
energy terms to the MAP objective ([3]i that act directly 
on values 5(y) rather than on expectations p, as in 0. 
However, this yields a difficult combinatorial optimization 
problem for prediction and does not yield a natural way to 


learn the parametrization of the global energy. Section 8.1 


demonstrates that using energy terms defined on marginals, 
and performing MAP inference in the associated MRF, per¬ 
forms as well or better than an LP technique designed to 
directly perform MAP subject to global penalty terms. 


Our second variational interpretation characterizes p* as a 
variational approximation to a complex joint distribution: 


-P c (y|x) = (l/^e,^)Pe(y|x)P v ,(y|x). (8) 


We assume that isolated marginal inference in P 0 (y|x) is 
tractable, while P^(y|x) is an alternative structured dis¬ 
tribution over y for which we do not have an efficient in¬ 
ference algorithm. Specifically, we assume that 0 can be 
solved for If. Furthemore, we assume that P^(y|x) oc 
exp (L^,(S(y); x)), where P^,(-;x) is a convex function, 
conditional on input features x. Going forward, we will 
often surpress the dependence of L. t p on x. Above, Zq ^ 
is the normalizing constant of the combined distribution. 
Note that if L was linear, inference in both fy (y|x) and 
P c (y|x) would be tractable, since the distribution would 
decompose over the same cliques as P 0 (y|x). 

Not surprisingly, 0 is intractable to reason about, due 
to the non-local terms in 0, so we approximate it with 
a variational distribution Q( y). The connection between 
this variational approximation and Proposition [l]is derived 
in Appendix [A] Here, we assume no clique structure on 
Q( y), but show that minimizing a variational approxima¬ 
tion of KL (Q(y)| |P c (y|x)), for a given x, yields a Q that 
is parametrized compactly as the MRF in Proposition [I] 
We discuss the relationship between this and general mean- 
field inference in Section [5] 


Although the analysis of this section assumes convexity 
of our inference techniques can be applied to non- 
convex L^, as discussed in Section 6.3 and our learning 
algorithm produces state-of-the-art results even in the non- 
convex regime for a benchmark OCR task. 


5 RELATED MODELING TECHNIQUES 


Mean field variational inference in undirected graphical 
models is a particular application of our inference frame¬ 
work, with a non-convex Lp, (Wainwright & Jordan 2008)>. 









The technique estimates marginal properties of a complex 
joint distribution P using the clique marginals fi of some 
tractable base distribution Q , not necessarily fully factor¬ 
ized. This induces a partitioning of the cliques of P into 
those represented directly by p, and those where we define 
clique marginals as a product distribution of the relevant 
nodes’ marginals in //. To account for the energy terms of 
the full model involving cliques absent in the simple base 
model, the energy (0,/x) of the base model is augmented 
with an extra function of fi. 

L(/z) = -W0 c ,(g)/i„\ (9) 

cG C \ n£c / 


where C is the set of cliques not included in the tractable 
sub-model, 9 C are the potentials of the original graphical 
model corresponding to the missing cliques, and (x) ;j //„ 
represents a repeated outer (tensor) product of the node 
marginals for the nodes in those cliques. 

Note L(p) is non-linear and non-convex. Our work gener¬ 
alizes |9]) by allowing arbitrary non-linear interaction terms 
between components of //. This is very powerful - for ex¬ 


ample, in our citation extraction experiments in Section 8.1 


expressing these global terms in a standard graphical model 
would require many factors touching all variables. Lo¬ 
cal coordinate ascent mean-field can be frustrated by these 
rigid global terms. Our gradient-based method avoids these 
issues by updating all marginals simultaneously. 

Dual decomposition is a popular method for performing 
MAP inference in complex structured prediction models 
by leveraging repeated calls to MAP in tractable submod¬ 
els ( |Komodakis et al.||2007[|Sontag et alT| |201 1[ >. The fam¬ 
ily of models solvable with dual decomposition is limited, 
however, because the terms that link the submodels must 
be expressible as linear constraints. Similar MAP tech¬ 
niques (Ravikumar et ak 2010| Martins et al. 2011 Fu & 


Banerjee 2013| l based on the alternating direction method 
of multipliers ( ADMM) can be adapted for marginal infer¬ 
ence, in problems where marginal inference in submodels 
is tractable. However, the non-local terms are defined as 
linear functions on settings of graphical model nodes, while 
our non-linear L^(p) terms provide practitioners with an 
expressive means to learn and enforce regularities of the 
inference output. 

Posterior regularization (PR) (|Ganchev et al.| |2010|>, 


learning from measurements (LFM) Liang et al. 


(2009) 


, and generalized expectations (GE) (Mann & McCal 
|lum| |2010| l, are a family of closely-related techniques for 
performing unsupervised or semi-supervised learning of 
a conditional distribution Pg(y|x) or a generative model 
Pe(x|y) using expectation-maximization (EM), where the 
E-step for latent variables y does not come directly from in¬ 
ference in the model, but instead from projection onto a set 
of expectations obeying global regularity properties. In PR 


and GE, this yields a projection objective of the form ([2}, 
where the terms come from a Lagrangian relaxation of 
regularity constraints, and ip corresponds to dual variables. 
Originally, PR employed linear constraints on marginals, 
but He et al. (2013) extend the framework to arbitrary con¬ 
vex differentiable functions. Similarly, in LFM such an in¬ 
ference problem arises because we perform posterior in¬ 
ference assuming that the observations y have been cor¬ 
rupted under some noise model. Tarlow & Zemel (2012) 
also present a method for learning with certain forms of 
non-local losses in a max-margin framework. 


Our goals are very different than the above learning meth¬ 
ods. We do not impose non-local terms L^ in order to regu¬ 
larize our learning process or allow it to cope with minimal 
annotation. Instead, we use L^ to increase the expressiv¬ 
ity of our model, performing inference for every test ex¬ 
ample, using a different ip, since it depends on input fea¬ 
tures. Since we are effectively ‘learning the regularizer,’ on 
fully-labeled data, our learning approach in Section [7] dif¬ 
fers from these methods. Finally, unlike these frameworks, 
we employ non-convex L^ terms in some of our experi¬ 
ments. The algorithmic consequences of non-convexity are 
discussed in Section [631 


6 OPTIMIZING THE NON-LOCAL 

MARGINAL INLERENCE OBJECTIVE 

We now present an approach to solving 0 using non- 
Euclidean projected gradient methods, which require ac¬ 
cess to a procedure for marginal inference in the base dis¬ 
tribution (which we term the marginal oracle ), as well as 
access to the gradient of the energy function L^. We pose 
these algorithms in the composite minimization framework, 
which gives us access to a wide variety of algorithms that 
are discussed in the supplementary material. 


6.1 CONVEX OPTIMIZATION BACKGROUND 


Before presenting our algorithms, we review several defi¬ 
nitions from convex analysis (Rockafellar 1997). 


We call a function ip a-strongly convex with respect to a 
norm || • ||p, if for all x, y G dom((p), 


V(y) > <P(x) + ^v{x) T (y - z) + || \y~ xf P . 

Proposition 2 (e.g. Beck & Teboulle (2003)). The nega¬ 
tive entropy function —H(x) = is 1-strongly 

convex with respect to the 1-norm || • ||i over the interior of 
the simplex A (restricting dom(H) to int(A)). 


Given a smooth and strongly convex function ip, we can 
also define an associated generalized (asymmetric) distance 
measure called the Bregman divergence (Bregmanl 1967) 















































Algorithm 1 Bethe-RDA 

Input: parameters 9 , energy function L^(p) 
set 9 0 = 9 

set /j,o to prox-center MARGINAL-ORACLE(@o) 

3o = 0 

repeat 

Pt = constant > 0 

9t = ^St-i + |VL(/x t ) 

= 9 t+p t 9t 

Ht = MARGINAL-ORACLE(6»f) 
until CONVERGED (p t , 


generated by p, 

B v (x, x 0 ) = p{x) - p(x 0 ) - (Vp(xo), x — Xq) . 

For example, the KL divergence is the Bregman divergence 
associated to the negative entropy function, and the squared 
Euclidean distance is its own associated divergence. 

Composite minimization ( jPassty | fl 979[ > is a family of tech¬ 
niques for minimizing functions of the form h = f + R, 
where we have an oracle that allows us to compute min¬ 
imizations over R in closed form (usually R here takes 
the form of a regularizer). Problems of this form are often 
solved with an algorithm called proximal gradient , which 
minimizes h(x) over some convex set X using: 

x t + 1 = argmin {Vf(x t ),x) + - a; t ||| + R(x), 

x<ZX 2i) t 

for some decreasing sequence of learning rates rj t - Note 
that because of the requirement x £ X, proximal gradi¬ 
ent generalizes projected gradient descent - since uncon¬ 
strained minimization might take us out of the feasible re¬ 
gion X, computing the update requires projecting onto X. 

But there is no reason to use the squared Euclidean dis¬ 
tance when computing our updates and performing the pro¬ 
jection. In fact, the squared term can be replaced by any 
Bregman divergence. This family of algorithms includes 
the mirror descent and dual averaging algorithms (jBeck & 
|Teboulle| [2003] |Nesterov| |2009) . 

We base our projected inference algorithms on regularized 
dual averaging (RDA) (|Xiaol[20T0|. The updates are: 


6.2 OUR ALGORITHM 

These non-Euclidean proximal algorithms are especially 
helpful when we are unable to compute a projection in 
terms of Euclidean distance, but can do so using a different 
Bregman divergence. We will show that this is exactly the 
case for our problem of projected inference: the marginal 
oracle allows us to project in terms of KL divergence. 

However, to maintain tractability we avoid using the 
entropy function H on the exponentially-large simplex 
A, and instead optimize over the structured, factorized 
marginal polytope M and its corresponding structured 
Bethe entropy fTg. For tree-structured models, H and H g 
have identical values, but different inputs. It remains to 
show the strong convexity of —Hg so we can use it in RDA. 

Proposition 3. For trees with n nodes, the negative Bethe 
entropy function —Hg is |(2n—l) -2 -strongly convex with 
respect to the 2-norm over the interior of the marginal poly¬ 
tope M. 


Proof. Consequence of Lemma 1 in 


Fu & Banerjee ( 2013) >. 


With these definitions in hand, we present Bethe-RDA pro¬ 
jected inference Algorithm [T] This algorithm corresponds 
to instantiating ( flQ| with R = — TTg — ( 9 . p) and p = 
—Hg. Note the simplicity of the algorithm when choos¬ 
ing f3 t = 0. It is intuitively appealing that the algorithm 
amounts to no more than calling our marginal inference or¬ 
acle with iteratively modified parameters. 

Proposition 4. For convex energy functions and convex 
—Hb, the sequence of primal averages ofAlgorithm^con- 
verges to the optimum of the variational objective 0 with 
suboptimality of 0{ ln ^) at time t. 


Proof This follows from Theorem 3 of Xiao ( 2010| along 
with the strong convexity of—fTg. □ 


If we have more structure in the energy functions, specifi¬ 
cally a Lipschitz-continuous gradient, we can modify the 
algorithm to use Nesterov’s acceleration technique and 
achieve a convergence of 0(4-). Details can be found in 
Appendix [D] Additionally, in practice these problems need 
not be solved to optimality and give stable results after a 
few iterations, as demonstrated in Figure [8T| 


x t +i = argmin 

x&X 


( 9 t,x) + —p(x) + R{ x). 


GO) 


6.3 INFERENCE WITH NON-CONVEX, 
NON-LOCAL ENERGIES 


where g t = J V/(xfc) is the average gradient of / en¬ 
countered so far. One benefit of RDA is that it does not re¬ 
quire the use of a learning rate parameter (/3 t = 0) when us¬ 
ing a strongly convex regularizer. RDA can be interpreted 
as doing a projection onto X using the Bregman divergence 
generated by the strongly convex function p + R. 


An analogy can be made here to loopy belief propaga¬ 
tion - even in the case of non-convex loss functions (and 
even non-convex entropy functions with associated inexact 
marginal oracles), the updates of our inference (and learn¬ 
ing) algorithms are well-defined. Importantly, since one of 
our motivations for developing non-local inference was to 
























Algorithm 2 Learning with non-local energies 

Input: examples x,;. y, and inference oracle MARGQ 
for distributions with the clique structure of Pg(y|x). 
Output: parameters (9, f if>) for /' : (y|x). 
repeat 
//E-Step 

for all (Xj, y, : ) do 

Pi <r- (Algorithm [TJ // using 9 , tj) and MARG() 

Pi «— (Proposition |5j // using ip, //, 

// note Qi{yi) is a CRF with potentials 9 + pi. 

end for 

//M-Step (gradient-based learning of CRF parameters) 

repeat 

rrii •<— MARG(Qi) Vj //standard CRF inference 

v e «- E i s (yi) ~ m i 

v*<-Ei3| T ($(*)-"*) 

9 <— Gradient-Step(0, Vo) 
tj) 4— Gradient-Step(r/:, V^,) 
until converged 

until converged OR iter > max Jters 


generalize mean field inference, and the additional penalty 
terms are non-convex in that case, we would like our algo¬ 
rithms to work for the non-convex case as well. 


Unlike loopy belief propagation, however, since we derive 
our algorithms in the framework of composition minimiza¬ 
tion, we have access to a wealth of theoretical guarantees. 
Based on results from the theory of optimization with first- 
order surrogate loss functions ( Mairal)|2013j >, in Appendix 
[C]we propose a small modification to Algorithm |T] with an 
asymptotic convergence condition even for non-convex en¬ 
ergies. In practice we find that the unmodified Algorithm 
|T] also works well for these problems, and experimentally 
in Section [ih2} we see good performance in both inference 
and learning with non-convex energy functions. 


7 LEARNING MODELS WITH 
NON-LOCAL ENERGIES 

We seek to learn the parameters 9 and tj) of the underlying 
CRF base model and L,p, respectively. Let S = { y L , x,} 
be n training examples. Let Q(y,: ft, ) be the variational 
distribution for y, resulting from applying Proposition [T| 
Namely, Q (y, : //,,) is an MRF with parameters 

Pi ■■= 9- (11) 


Algorithm 3 Doubly-stochastic learning with Ly, given by 
a sum of scalar functions of linear measurements ja_ 

Input: examples x, , y, and MARGINAL-ORACLEQ 
for distributions with the clique structure of Pg(y|x). 
Output: parameters {9,ip) forP c (y|x). 
repeat 

sample (xi,yj) randomly 
p i <— (Algorithm [TJ 
Vo <- S(yi) - Pi 

Vejjpija] (S(yi) - p^ 

9 <— Gradient-StepPL Vo) 
tj> e- Gradient-Step(r/: ) V^,) 
until converged OR iter > max_iters 


tj) interacts with the data in a complex manner that prevents 
us from using standard learning techniques for the expo¬ 
nential family. Namely, we can not easily differentiate a 
likelihood with respect to tj), since this requires differenti¬ 
ating the output p of a convex optimization procedure, and 
the extra Lp, term in (|2j prevents the use of conjugate du¬ 
ality relationships available for the exponential family. We 
could have used automatic methods to differentiate the it¬ 
erative inference procedure (Stoyanov et al.|[201 l||Dornke| 


2012 1 , but found our learning algorithm works well. 


We employ a variational learning algorithm, presented in 
Algorithm[2j alternately updating the parameters M of our 
tractable CRF-structured variational distributions, and up¬ 
dating the parameters {9,tf>) assuming the following surro¬ 
gate likelihood given by these CRF approximations: 


L(9,tp-M) = Y^^ogQ(y i ;p i ). (12) 

i 

Given 9 and tj), we update M using Algorithm [T] Given 
M, we update 9 and tj) by taking a single step in the di¬ 
rection of the gradient of the surrogate likelihood ([TJJ. We 
avoid taking more than one gradient step, since the gradi¬ 
ents for 9 and tj) depend on M and an update to 9 and tj) 
will break the property that p (Q(y: p^) = p r Therefore, 
we recompute //, every time we update the parameters. 


Overall, it remains to show how to compute gradients 
of ( fl2| i. For 9, we have the standard CRF likelihood gradi¬ 
ent ( Sutton & McCallum]|2006| : 

V g L{9,tj)-M) = Y J S{y i )-p i . (13) 

i 

For tj), we have: 

V^L(9,ij>;M) = Y^^-^-logQiy^Pi). (14) 


We employ the notation Q(y, . p j) to highlight the role of 
pp. for a given (y, , x,) pair, the family of variational distri¬ 
butions over y, is indexed by possible values of fi, (recall 
we suppress the explicit dependence of 9 and tj) on x). Fi¬ 
nally, define the shorthand M = {pi ,..., p n }. 


From {TT}, logQjyp p^ is also S{yi) - p t and 


dpi __ d_d_ T , x 
dtp dtj> dp ^ ^ 


( 15 ) 
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Clearly, this depends on the structure of Consider the 
parametrization 0- With this, we have: 

WjdH L ^ = (16) 

Therefore, we have log Q(yi \/rj = 

Vfj(/r)^£j(/r) T (S'(y) —/rj. For linear measure¬ 
ments 0, this amounts to 

V^Qw) (aJS(y) - ajm) . (17) 

This has a simple interpretation: the gradient with respect 
to ipj equals the gradient of the scalar loss £j at the current 
marginals fij times the difference in linear measurements 
between the ground truth labels and the inferred marginals. 

Algorithm [2] has an expensive double-loop structure. In 
practice it is sufficient to employ a ‘doubly-stochastic’ ver¬ 
sion given in Algorithm[3] where we sample a training ex¬ 
ample (x,;, y, : ) and use this to only perform a single gra¬ 
dient step on 9 and ip. To demonstrate the simplicity of 
implementing our learning algorithm, we avoid any ab¬ 
stract derivative notation in Algorithm [3] by specializing 
it to the case of ( fl7j ). In our experiments, however, we 
sometimes do not use linear measurements. Overall, all 
our experiments use the fast doubly-stochastic approach of 
Algorithm [3] solely, since it performs well. In general, our 
learning algorithms are not guaranteed to converge because 
we approximate the complex interaction between ip and fj, 
with alternating updates. In practice, however, terminating 
after a fixed number of iterations yields models that gener¬ 
alize well. 

Finally, recall that the notation L^(fi i ) suppresses the po¬ 
tential dependence of ip on x,. We assume each ip^ is a 
differentiable function of features of x,. Therefore, in our 
experiments where ip depends on x,. we perform gradient 
updates for the parametrization of ippx.) via further appli¬ 
cation of the chain rule. 
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Figure 1: Citation extraction FI when limiting maximum 
number of test-time inference iterations. Most of our accu¬ 
racy gain is captured within the first 5-10 iterations. 


We first apply our algorithm to the NLP task of performing 


text field segmentation on the UMass citation dataset (An- 


zaroot & McCallum, 2013)l, which contains strings of cita¬ 


tions from research papers, segmented into fields (author, 
title, etc.). Our modeling approach, closely follows Anza- 


root et al. (2014 1 , who extract segmentations using a linear- 


chain segmentation model, to which they add a large set of 
‘soft’ linear global regularity constraints. 


Let y be a candidate labeling. Imagine, for example, that 
we constrain predicted segmentations to have no more pre¬ 
dicted last names than first names. Then, the numbers of 
first and last names can be computed by linear measure¬ 
ments a f | rst S'(y) and a^iS^y), respectively. A hard con¬ 
straint on y would enforce “firstly) - aj st S{y) = 0 . This 
is relaxed in|Anzaroot et al.|(|2014|) to a penalty term 


c4 (al sl S(y) - aj st S(y)) (18) 


that is added to the MAP inference objective, where 
lh{x) = max(l — x , 0) is a hinge function. For multiple 
soft constraints, the overall prediction problem is 


argmin (—9, Spy)} + ^Cj4 (aJS(y)) , (19) 

y 


8 EXPERIMENTS 

8.1 CITATION EXTRACTION 


Model 

FI 

Our Baseline 

Non-local Energies 

94.47 

95.47 

Baseline (Anzaroot et al. 

2014) 

94.41 

95.39 

Soft-DD (Anzaroot et al. 

2014) 


Table 1: Comparison of FI scores on Citation Extraction 
dataset. We compare MAP inference FI scores of our non¬ 
local energy model and the specialized dual decomposition 
model of Anzaroot et al. ( 2014[ ). Both variants learn global 
regularities that significantly improve performance. 


where 9 are the parameters of the underlying linear-chain 
model. They use a dual decomposition style algorithm for 
solving m that crucially relies on the specific structure 
of the hinge terms 4- They learn the Cj for hundreds of 
‘soft constraints’ using a perceptron-style algorithm. 


We consider the same set of measurement vectors aj, but 
impose non-local terms that act on marginals fx rather 
than specific values y. Further, we use smoothed hinge 
functions, which improve the convergence rate of infer¬ 
ence (Rennie 2005) 1. We find the variational distribution by 
solving the marginal inference version of ( |T9| , an instance 
of our inference framework with linear measurements |5]): 


argmin (~9,fi) - H B (n) + ^ Cjl h (aj/x) , (20) 






































As in Anzaroot et al. ( 2014) 1, we first learn chain CRF pa¬ 
rameters 6 on the training set. Then, we learn the Cj param¬ 
eters on the development set, using Algorithm [3] and tune 
hyperparameters for development set performance. At both 
train and test time, we ignore any terms in ([20]) for which 


Cj < 0 . 


We present our results in Table [T] measuring segment- 
level FI. We can see that our baseline chain has slightly 


higher accuracy than the baseline approach of Anzaroot 
et al. ( |2014) >, possibly due to optimization differences. Our 
augmented model (Non-Local Energies) matches and very 
slightly beats their soft dual decomposition (Soft-DD) pro¬ 
cedure. This is especially impressive because they employ 
a specialized linear-programming solver and learning al¬ 
gorithm adapted to the task of MAP inference under hinge- 
loss soft constraints, whereas we simply plug in our general 
learning and inference algorithms for non-local structured 
prediction - applicable to any set of energy functions. 

Our comparable performance provides experimental evi¬ 
dence for our intuition that preferences about MAP con¬ 
figurations can be expressed (and “relaxed”) as functions 


of expectations. Anzaroot et al. (2014) solve a penalized 
MAP problem directly, while our prediction algorithm first 
finds a distribution satisfying these preferences, and then 
performs standard MAP inference in that distribution. 

Finally, in Figure |T] we present results demonstrating that 
our algorithm’s high performance can be obtained using 
only 5-10 calls per test example to inference in the under¬ 
lying chain model. In Section[B] we analyze the empirical 
convergence behavior of Algorithm|T] 


8.2 HANDWRITING RECOGNITION 


N-Grams 

2 

3 

4 

5 

6 

Accuracy 

85.02 

96.20 

97.21 

98.27 

98.54 


Table 2: Character-wise accuracy of Structured Prediction 
Cascades ( |Weiss et al.||2012i ) on OCR dataset. 


Accuracy 


3- gram 

4- gram 

5- gram 

6- gram 


85.02 

96.20 

97.21 
98.27 
98.54 


Table 3: Character-wise accuracy of our baselines, and 
models using learned non-local energies on Handwriting 
Recognition dataset. Note that word classifier baseline is 
also given in character-wise accuracy for comparison. 


We next apply our algorithms to the widely-used handwrit¬ 
ing recognition dataset of Taskar et al. (2004|. We follow 


Model 

Accuracy 

2-gram (base model) 

Til 

(MM) 

T W 

LI (MM) 

84.93 

94.01 

94.96 

98.26 

98.83 

55-Class Classifier (MM) 

86.06 


Table 4: Character-wise accuracy of our baselines, and 
models using learned non-local energies on Handwriting 
Recognition dataset. Note that word classifier baseline is 
also given in character-wise accuracy for comparison. 


the setup of Weiss et al. ( 2012} , splitting the data into 10 
equally sized folds, using 9 for training and one to test. We 
report the cross-validation results across all 10 folds. 


The structured prediction cascades of Weiss et al. ( |2012| 
achieve high performance on this dataset by using ex¬ 
tremely high order cliques of characters (up to 6-grams), 
for which they consider only a small number of candi¬ 
date outputs. Their state-of-the-art results are reproduced 
in Table[2] The excellent performance of these large-clique 
models is consequence of the fact that the data contains 
only 55 unique words, written by 150 different people. 
Once the model has access to enough higher-order context, 
the problem becomes much easier to solve. 


With this in mind, we design two non-convex, non-local 
energy functions. These energies are intended to regularize 
our predictions to lie close to known elements of the vo¬ 
cabulary. Our base model is a standard linear-chain CRF 
with image features on the nodes, and no features on the 
bigram edge potentials. Let U{p) = be a func¬ 

tion that takes the concatenated vector of node and edge 
marginals and sums up all of the node marginals, giv¬ 
ing the global unigram expected sufficient statistics. Let 
{ui} = {U indicate the set of all such unique vec¬ 
tors when applying U to the train set empirical sufficient 
statistics for each data case y ,. Simply, this gives 55 vec¬ 
tors Ui of length 26 containing the unigram counts for each 
unique word in the train set. 


Our intuition is that we would like to be able to “nudge” 
the results of inference in our chain model by pulling the 
inferred U (//) to be close to one of these global statistics 
vectors. We add the following non-convex non-local en¬ 
ergy function to the model: 


= i/>min ||Ui - U(p)\\ 


1 - 


( 21 ) 


We learn two variants of this model, which differently 
parametrize the dependence of i/> on x. The first has a 
single bias feature on the non-local energy. The second 
conditions on a global representation of the sequence: con¬ 
cretely, we approximate the RBF kernel mean map (MM) 
(Smola et al. 2007)1 using random Fourier features (RFF) 


























































(Rahimi & Recht 2007). This simply involves multiplying 

s 

625 

10k 

50k 

each image feature vector in the sequence by a random ma¬ 
trix with ~ 1000 rows, applying a pointwise non-linearity. 

Our Method 

0.19 

2.7 

14 

IP 

2.8 

93 

690 


and taking t/j to be a linear function of the average vector. 


Results of these experiments can be seen in Table [4] 
Adding the non-local energy brings our performance well 
above the baseline bigram chain model, and our training 
procedure is able to give substantially better performance 
when ip depends on the above input features. 


Table 5: Comparison of runtime (in seconds, averaged over 


10 trials) between the interior point solver (IP) of Sheldon 


et al. (2013)> v.s. Algorithm |T| on different CGM problem 


sizes s, the cardinality of the edge potentials in the under¬ 
lying graphical model, where marginal inference is O(s). 


The energy L^, based on unigram sufficient statistics, is 
not able to capture the relative ordering of letters in the vo¬ 
cabulary words, which the structured prediction cascades 
models do capture. This motivates us to consider another 
energy function. Let {w^} = {p n {yi)} be the set of unique 
vectors of concatenated node marginal statistics for the 
train set. This gives 55 vectors of length U * 26, where l t is 
the length of the zth distinct train word. Next, we define a 
different energy function to add to our base chain model: 


In previous work, CGMs have been successfully applied to 
modeling bird migration. Here, the base model is a lin¬ 
ear chain representing a time series of bird locations. Each 
observed variable corresponds to counts from bird watch¬ 
ers in different locations. These observations are assumed 
to be Poisson distributed with rate proportional to the true 
count of birds present. The CGM MAP task is to infer the 
underlying migration patterns. 


= ipm.m\\wi - //||i. (22) 

^ i 

Once again we implement featurized and non-featurized 
versions of this model. As noted in structured prediction 
cascades, giving the model access to this level of high- 
order structure in the data makes the inference problem ex¬ 
tremely easy. Our model outperforms the best structured 
prediction cascades results, and we note again an improve¬ 
ment from using the featurized over the non-featurized xp. 

Of course, since the dataset has only 55 actual labels, and 
some of those are not valid for different input sequences 
due to length mismatches, this is arguably a classification 
problem as much as a structured prediction problem. To 
address this, we create another baseline, which is a con¬ 
strained 55-class logistic regression classifier (constrained 
to only allow choosing output classes with appropriate 
lengths given the input). We use our same global mean- 
map features from the L ^ {MM) variants of the structured 
model and report these results in Table [4] We also tune the 
number of random Fourier features as a hyperparameter to 
give the classifier as much expressive power as possible. 
As we can see, the performance is still significantly below 
the best structured models, indicating that the interplay be¬ 
tween local and global structure is important. 


|Sheldon et al.| ( j2013| > demonstrate that MAP in CGMs is 
NP-hard, even for trees , but that approximate MAP can be 
performed by solving a problem of the form |2]): 

n 

H* = argmax ( 6 , p) + H B {p) + ^ Pi{Pi\ipyi) (23) 


where Pi are (concave) Poisson log-likelihoods and each 
y i is an observed bird count. 

For the case where the underlying CGM graph is a tree, the 


‘hard EM’ learning algorithm of Sheldon et al. (2013) is 
the same as Algorithm[2]specialized to their model. There¬ 


fore, Sheldon et al. (2013) provide additional experimen¬ 


tal evidence that our alternating surrogate-likelihood opti¬ 
mization works well in practice. 

The learning procedure of Sheldon et al. ( j2013] > is very 
computationally expensive because they solve instances 
of ( [23] ) using an interior-point solver in the inner loop. For 
the special case of trees. Algorithm [T]is directly applicable 
to Using synthetic data and code obtained from the 
authors, we compare their generic solver to Algorithm [I] 
for solving instances of ( |23| . In Table [5] we see that our 
method achieves a large speed-up with no loss in solution 
accuracy (since it solves the same convex problem). 


8.3 COLLECTIVE GRAPHICAL MODELS 


Next, we demonstrate that that our proximal gradient-based 
inference framework dramatically speeds up approximate 


inference in collective graphical models (CGMs) (Sheldon 


|& Dietterich[ |201 1| ). CGMs are a method for structured 
learning and inference with noisy aggregate observation 
data. The large-scale dependency structure is represented 
via a graphical model, but the nodes represent not just sin¬ 
gle variables, but aggregate sufficient statistics of large sets 
of underlying variables, corrupted by some noise model. 


9 DISCUSSION AND FUTURE WORK 

Our results show that our inference and learning frame¬ 
work allows for tractable modeling of non-local depen¬ 
dency structures, resistant to traditional probabilistic for¬ 
mulations. By approaching structured modeling not via in¬ 
dependence assumptions, but as arbitrary penalty functions 
on the marginal vectors p, we open many new modeling 
possibilities. Additionally, our generic gradient-based in¬ 
ference method can achieve substantial speedups on pre- 










































existing problems of interest. In future work, we will apply 
our framework to new problems and new domains. 
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Supplementary Material 


A Variational Approximation 

During learning, reasoning about P c (y|x) in ([8J is difficult, due to the intractability of Zg^. In response, we approximate 
it with a variational distribution: 


Q{ y) = arg minP(Q';x, 9, ip), 
Q' 


(24) 


where 

F(Q') = KL(Q'(y)\\P c (y\x)) 

= -H{Q')-¥, Q ,[{e,S{y))}+W, Q ,[L^{S{y))\ (25) 

» -H(Q') - (6, p(Q')) + M/*(Q'))- (26) 

Given x, 0, and ij>, we select Q by minimizing the approximation ( |26| ). Note that the surrogate we minimize is a lower 
bound to ( |25] >, as EQ/[Z/^,(5(y))] > by Jensen’s inequality and the convexity of L. This differs from many 

mean-field variational inference approaches that minimize an upper bound. 

So far, we have not assumed any structure on Q. Next, we show that the minimizer of ( |26[ > is a MRF with the same clique 
structure as Pg. This provides an alternative derivation of the techniques in Section[4] 

Let q y denote the probability under Q of a given joint configuration y. There are exponentially-many such q y , and U (Q) 
is the entropy on the simplex — ]Tj y q y log(< 7 y ). Since Q minimizes ([26]), we have the following stationarity condition for 
every q y : 


-j- [—H(Qcf>) - q y log(P 0 (y|x)) + L^(fj,(Q^))] + X = 0 (27) 

aq y 

Here, A is a dual variable for the constraint f2 y q y = 1. Rearranging, we have: 

Q(y) = (28) 

(l/Z)P g (y\x)exp , (29) 

where Z is a normalizing constant. 

Proposition 5. There exists a vector p such that the quantity (s^-MMQ))) (4^(^)) = P Ts (y)f oral1 9 y - Fur¬ 

thermore, p is a simple, closed-form function of fi(Q). 

Proof We have = s (y)> since KQ) = E y Qy S ( y)- Therefore, p = □ 

Corollary 1. Since Pe(y|x) ex ( 6 , S(y)), Proposition [5] implies Q(y) is an MRF with the same clique decomposition as 

Pe{ y|x). 

So far, Q is implicitly defined in terms of its own marginals p{Q). Since we assume Pg and If, have the same sufficient 
statistics S{ y), we can use the Bethe entropy representation H(Q) = TTg (pUf)). This transforms ( [26] ) to the augmented 
inference problem |2]). Therefore, we can directly solve for p(Q), which can then be used to provide a closed-form 
expression for the CRF distribution Q. 

B Additional Experiments 

In Figure [2] we examine the convergence behavior of our algorithm on the citation dataset. This demonstrates that our 
inference procedure converges quite quickly except for a small number of difficult cases, where the global energy and the 
local evidence are in significant disagreement. 


95.6 



Figure 2: The number of iterations taken for inference to converge on test set citations, as a percentage of the total number 
of test cases. Number of iterations is capped at 40. We can see that the distribution is long tailed. Inference converges 
within 40 iterations for 93.7 of examples, and each example takes an average of 9.8 iterations to converge. 


Algorithm 4 Bethe-MD 

Input: parameters 0 , energy function L(fi). learning rate sequence { r / t } 
set //(| to prox-center MARGIN AL-ORACLE(0) 

repeat 

9t = + TjtVL(nt-i) 

IM = MARGINAL-ORACLE^ (^0 - g t )) 
until CONVERGED (g tl p t - i) 


C Non-Convex Energies and Composite Mirror Descent 


We introduce a small modification of Algorithm [T] along with a rough proof sketch of its convergence even in the case of 
non-convex energy functions. Because it leans heavily on significant prior work in optimization, it is hard to give a self- 
contained proof of the results in this section, and our argument takes the form of a proof sketch that appeals to these other 
works. However, the basic argument simply combines the strong convexity of iTg and its associated Bregman divergence, 
along with the results of|Mairal ( 2013] ) for the case of composite minimization of non-convex functions using the Euclidean 
Bregman divergence, and the fact that the local updates performed using entropy i?g as a distance-generating function have 
a log-barrier function for the constraint set Af, effectively bounding the norm of the gradient of fTg when restricted to the 
set of iterates actually visited during optimization. 

While Algorithm [I] was built on the framework of regularized dual averaging (RDA), we introduce a slightly different 
formulation based on composite mirror descent (COMID) (Duchi et al. 2010| >. Like RDA, COMID is a gradient method 
for minimizing functions of the form h = f + R. At each time step t, COMID makes the update 


w t+ i = argmin (\7 f(w t ),w) H- B v (w,w t ) + R{w) (30) 

w T]t 

where ip is some strongly convex function and B v is its associated Bregman divergence. In Algorithm [4] we present an 
instantiation of composite mirror descent for our inference problem. 

At first glance, this seems significantly different from our original Algorithm [I] but remembering that VfTg(/r t ) = G t 
because of conjugate duality of the exponential family, we can see that it actually only a corresponds to a slight re-weighting 
of the iterates of Algorithm [T] 

First, we give Algorithm[4]similar guarantees in the convex setting as we did for Algorithm[T] 

Proposition 6. For convex energy functions and convex — Hb, given the learning rate sequence i/t = where A is the 
strong convexity of —Hg, the sequence of primal averages of Algorithm^converges to the optimum of the variational 
objective (]2]i with suboptimality ofO^-^p-) at time t. 

















Proof. This follows from a standard online-to-batch conversion, along with the strong convexity of Hr and Theorem 7 of 
[Duchi et ~aT7| ( |2010] ). □ 


Now, having introduced composite mirror descent in ( |30| ), will lean heavily on the framework for optimization with first- 
order surrogate losses of Mairal ( 2013|) to show that these types of algorithms should converge even in the non-convex 
case. We now recall a few definitions from that work. 


First, we define the asymptotic stationary point condition, which gives us a notion of convergence in the non-convex 
optimization case. 


Definition 1 (Asymptotic Stationary Point (Mairal 2013)). 
say it satisfies an asymptotic stationary point condition if 


Fora sequence {6 n }n> 0. cmd differentiable function f, we 


lim ||V/(0„)|| 2 = O 

n —>-+oo 


We call a function L-strongly smooth if L is a bound on the largest eigenvalue of the Hessian - this tells us how the norm 
of the gradient changes. This is also known as a L-Lipschitz continuous gradient. Now we recall the notion of a majorant 
first-order surrogate function. 


Definition 2 (Majorant First-Order Surrogate (Mairal 20131). A function g: 
of f near k when the following conditions are satisfied 


is a majorant first-order surrogate 


• Majorant: we have g > /. 

• Smoothness: the approximation error h = g—f is differentiable, and its gradient is L-Lipschitz continuous, moreover, 
we have h(tf) = 0 and SJh{n) = 0 


We denote by Sr (/. n) the set of such surrogates. 


Now we recall the majorant first-order surrogate property for the composite minimization step in the case of Euclidean 
Bregman divergence (Euclidean distance). 


Proposition 7 (Proximal Gradient Surrogates (Mairal 2013| >). Assume that h = f + R where f is differentiable with an 
L-Lipschitz gradient. Then, h admits the following majorant surrogate in <S 2 i(/, k ) : 


g{9) = f (ac) + V/(ac) t (0 - k) + ^\\6 - «||! + R{6) 


(31) 


We can use this result to establish a majorant property for the composite mirror descent surrogate ( |30| ) given a strongly 
convex and strongly smooth Bregman divergence. 

Proposition 8 (Composite Mirror Descent Surrogates). Assume that h = f + R where f is differentiable with an L- 
Lipschitz gradient, ip is a o-strongly convex and y- strongly smooth function, and B v is its Bregman divergence. Then, h 
admits the following majorant surrogate in Sr+rx (/, k): 

g(0) = /(«) + V/(«) T (0 - At) + At) + R(9) (32) 

Proof. By the definition of strong convexity and the Bregman divergence, ( [32] ) upper bounds ( |3T| ), so it is a majorant of h. 
Additionally, by the additive property of strong smoothness, we get the strong smoothness constant for the surrogate. □ 


However, small technical conditions keep Proposition [8] from applying directly to our case. The Bethe entropy Hr, and 
thus its associated Bregman divergence, is not strongly smooth - its gradient norm is unbounded as we approach the corners 
of the marginal polytope. However, it is locally Lipschitz - every point in the domain has a neighborhood for which the 
function is Lipschitz. In practice, since the —Hr mirror descent updates have a barrier function for the constraint set AL 
our iterative algorithm will never get too close to the boundary of the polytope and it is effectively strongly smooth for 
purposes of our minimization algorithm. This is not a rigorous argument, but is both intuitively plausible and born out in 
experiments. 






















Algorithm 5 Accelerated Bethe-RDA 

Input: parameters 9, energy function L(p.) 
set //(| to prox-center MARGINAL-ORACLE(0) 
set vq = fi o 

3o = 0 

repeat 

u t = (1 - c t )nt-i +ctv t -i 

gt = (l - Cf)p*-i + C(Vi(it t ) 

V t = MARGINAL-ORACLE) ) {0 ~ 9t )) 

IH = (1 - c t )nt-1 + CtV t 
until CONVERGED(/r t , p, t ~i) 


Proposition 9. The sequence of iterates Wt from Algorithm [7] when bounded away from the corners of the marginal 
polytope constraint set At, and for appropriate choice of learning rates {r/ t }, convex —H&, and L-strongly smooth (but 
possibly non-convex) energy function L^, satisfies an asymptotic stationary point condition. 


Proof. This follows from application of Proposition [8j and noting that Algorithm [4] corresponds to the generalized 
surrogate-minimization scheme in Algorithm 1 of Mairal ( |2013| . The asymptotic stationary point condition then fol¬ 
lows from Proposition 2.1 of Mairal (2013). The appropriate learning rates {?y t } must be chosen by the Lipschitz constant 
of the gradient of L r j,, as well as the effective Lipschitz constant of the gradient of // B , given how far we are bounded from 
the edge of the constraint set (this effective smoothness constant is determined by the norm of our parameter vector 6). □ 


In this section we have given a rough proof sketch for the asymptotic convergence of our inference algorithms even in the 
case of non-convex energies. Our heuristic argument for the effective smoothness of the entropy II$ is the most pressing 
avenue for future work, but we believe it could be made rigorous by examining the norm of the parameter vector and how 
it contributes to the “sharpness” of the barrier function for the mirror descent iterates. 


D Accelerated Bethe-RDA 


If we have L-strongly smooth losses (L is a bound on the largest eigenvalue of the Hessian), we can use an accelerated dual 
averaging procedure to obtain an even faster convergence rate of O(^). Let I) be the diameter of the marginal poly tope 
as measured by the strongly convex distance-generating function //;,■ (using its associated Bregman divergence.) Then 
Algorithm[5]gives us a convergence rate of ALI) 1 /t 1 by Corollary 7 of i 


Xiao 


( 2010 ). 


















